Linear Regression

Advance Analytics with R (UG 21-24)

Ayush Patel

Before we start

__Please load the following packages

library(tidyverse)
library(MASS)
library(ISLR)
library(ISLR2)



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objective

Learn to apply and interpret simple and multiple linear regression models.

References for this lecture:

  • Chapter 3, ISLR (reference)
  • Chapters 7 and 8, Intro to Modern Statistics (Reading for intuitive understanding)

Advertising Data

...1 TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6
11 66.1 5.8 24.2 8.6
12 214.7 24.0 4.0 17.4
13 23.8 35.1 65.9 9.2
14 97.5 7.6 7.2 9.7
15 204.1 32.9 46.0 19.0
16 195.4 47.7 52.9 22.4
17 67.8 36.6 114.0 12.5
18 281.4 39.6 55.8 24.4
19 69.2 20.5 18.3 11.3
20 147.3 23.9 19.1 14.6
21 218.4 27.7 53.4 18.0
22 237.4 5.1 23.5 12.5
23 13.2 15.9 49.6 5.6
24 228.3 16.9 26.2 15.5
25 62.3 12.6 18.3 9.7
26 262.9 3.5 19.5 12.0
27 142.9 29.3 12.6 15.0
28 240.1 16.7 22.9 15.9
29 248.8 27.1 22.9 18.9
30 70.6 16.0 40.8 10.5
31 292.9 28.3 43.2 21.4
32 112.9 17.4 38.6 11.9
33 97.2 1.5 30.0 9.6
34 265.6 20.0 0.3 17.4
35 95.7 1.4 7.4 9.5
36 290.7 4.1 8.5 12.8
37 266.9 43.8 5.0 25.4
38 74.7 49.4 45.7 14.7
39 43.1 26.7 35.1 10.1
40 228.0 37.7 32.0 21.5
41 202.5 22.3 31.6 16.6
42 177.0 33.4 38.7 17.1
43 293.6 27.7 1.8 20.7
44 206.9 8.4 26.4 12.9
45 25.1 25.7 43.3 8.5
46 175.1 22.5 31.5 14.9
47 89.7 9.9 35.7 10.6
48 239.9 41.5 18.5 23.2
49 227.2 15.8 49.9 14.8
50 66.9 11.7 36.8 9.7
51 199.8 3.1 34.6 11.4
52 100.4 9.6 3.6 10.7
53 216.4 41.7 39.6 22.6
54 182.6 46.2 58.7 21.2
55 262.7 28.8 15.9 20.2
56 198.9 49.4 60.0 23.7
57 7.3 28.1 41.4 5.5
58 136.2 19.2 16.6 13.2
59 210.8 49.6 37.7 23.8
60 210.7 29.5 9.3 18.4
61 53.5 2.0 21.4 8.1
62 261.3 42.7 54.7 24.2
63 239.3 15.5 27.3 15.7
64 102.7 29.6 8.4 14.0
65 131.1 42.8 28.9 18.0
66 69.0 9.3 0.9 9.3
67 31.5 24.6 2.2 9.5
68 139.3 14.5 10.2 13.4
69 237.4 27.5 11.0 18.9
70 216.8 43.9 27.2 22.3
71 199.1 30.6 38.7 18.3
72 109.8 14.3 31.7 12.4
73 26.8 33.0 19.3 8.8
74 129.4 5.7 31.3 11.0
75 213.4 24.6 13.1 17.0
76 16.9 43.7 89.4 8.7
77 27.5 1.6 20.7 6.9
78 120.5 28.5 14.2 14.2
79 5.4 29.9 9.4 5.3
80 116.0 7.7 23.1 11.0
81 76.4 26.7 22.3 11.8
82 239.8 4.1 36.9 12.3
83 75.3 20.3 32.5 11.3
84 68.4 44.5 35.6 13.6
85 213.5 43.0 33.8 21.7
86 193.2 18.4 65.7 15.2
87 76.3 27.5 16.0 12.0
88 110.7 40.6 63.2 16.0
89 88.3 25.5 73.4 12.9
90 109.8 47.8 51.4 16.7
91 134.3 4.9 9.3 11.2
92 28.6 1.5 33.0 7.3
93 217.7 33.5 59.0 19.4
94 250.9 36.5 72.3 22.2
95 107.4 14.0 10.9 11.5
96 163.3 31.6 52.9 16.9
97 197.6 3.5 5.9 11.7
98 184.9 21.0 22.0 15.5
99 289.7 42.3 51.2 25.4
100 135.2 41.7 45.9 17.2
101 222.4 4.3 49.8 11.7
102 296.4 36.3 100.9 23.8
103 280.2 10.1 21.4 14.8
104 187.9 17.2 17.9 14.7
105 238.2 34.3 5.3 20.7
106 137.9 46.4 59.0 19.2
107 25.0 11.0 29.7 7.2
108 90.4 0.3 23.2 8.7
109 13.1 0.4 25.6 5.3
110 255.4 26.9 5.5 19.8
111 225.8 8.2 56.5 13.4
112 241.7 38.0 23.2 21.8
113 175.7 15.4 2.4 14.1
114 209.6 20.6 10.7 15.9
115 78.2 46.8 34.5 14.6
116 75.1 35.0 52.7 12.6
117 139.2 14.3 25.6 12.2
118 76.4 0.8 14.8 9.4
119 125.7 36.9 79.2 15.9
120 19.4 16.0 22.3 6.6
121 141.3 26.8 46.2 15.5
122 18.8 21.7 50.4 7.0
123 224.0 2.4 15.6 11.6
124 123.1 34.6 12.4 15.2
125 229.5 32.3 74.2 19.7
126 87.2 11.8 25.9 10.6
127 7.8 38.9 50.6 6.6
128 80.2 0.0 9.2 8.8
129 220.3 49.0 3.2 24.7
130 59.6 12.0 43.1 9.7
131 0.7 39.6 8.7 1.6
132 265.2 2.9 43.0 12.7
133 8.4 27.2 2.1 5.7
134 219.8 33.5 45.1 19.6
135 36.9 38.6 65.6 10.8
136 48.3 47.0 8.5 11.6
137 25.6 39.0 9.3 9.5
138 273.7 28.9 59.7 20.8
139 43.0 25.9 20.5 9.6
140 184.9 43.9 1.7 20.7
141 73.4 17.0 12.9 10.9
142 193.7 35.4 75.6 19.2
143 220.5 33.2 37.9 20.1
144 104.6 5.7 34.4 10.4
145 96.2 14.8 38.9 11.4
146 140.3 1.9 9.0 10.3
147 240.1 7.3 8.7 13.2
148 243.2 49.0 44.3 25.4
149 38.0 40.3 11.9 10.9
150 44.7 25.8 20.6 10.1
151 280.7 13.9 37.0 16.1
152 121.0 8.4 48.7 11.6
153 197.6 23.3 14.2 16.6
154 171.3 39.7 37.7 19.0
155 187.8 21.1 9.5 15.6
156 4.1 11.6 5.7 3.2
157 93.9 43.5 50.5 15.3
158 149.8 1.3 24.3 10.1
159 11.7 36.9 45.2 7.3
160 131.7 18.4 34.6 12.9
161 172.5 18.1 30.7 14.4
162 85.7 35.8 49.3 13.3
163 188.4 18.1 25.6 14.9
164 163.5 36.8 7.4 18.0
165 117.2 14.7 5.4 11.9
166 234.5 3.4 84.8 11.9
167 17.9 37.6 21.6 8.0
168 206.8 5.2 19.4 12.2
169 215.4 23.6 57.6 17.1
170 284.3 10.6 6.4 15.0
171 50.0 11.6 18.4 8.4
172 164.5 20.9 47.4 14.5
173 19.6 20.1 17.0 7.6
174 168.4 7.1 12.8 11.7
175 222.4 3.4 13.1 11.5
176 276.9 48.9 41.8 27.0
177 248.4 30.2 20.3 20.2
178 170.2 7.8 35.2 11.7
179 276.7 2.3 23.7 11.8
180 165.6 10.0 17.6 12.6
181 156.6 2.6 8.3 10.5
182 218.5 5.4 27.4 12.2
183 56.2 5.7 29.7 8.7
184 287.6 43.0 71.8 26.2
185 253.8 21.3 30.0 17.6
186 205.0 45.1 19.6 22.6
187 139.5 2.1 26.6 10.3
188 191.1 28.7 18.2 17.3
189 286.0 13.9 3.7 15.9
190 18.7 12.1 23.4 6.7
191 39.5 41.1 5.8 10.8
192 75.5 10.8 6.0 9.9
193 17.2 4.1 31.6 5.9
194 166.8 42.0 3.6 19.6
195 149.7 35.6 6.0 17.3
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

Association between slaes and budget?

How strong is the association, if any?

sales and TV

[1] 0.7822244

sales and radio

[1] 0.5762226

sales and newspaper

[1] 0.228299

Linear model

A linear model can help us answer questions about association between response and predictors, predict sales in future, linearity of relation, and interaction between predictors.

A simple linear model

\[ Y \approx \beta_0 + \beta_1X \]

\[\beta_0\hspace{1mm} is\hspace{1mm}population\hspace{1mm}intercept\]

\[\beta_1\hspace{1mm} is\hspace{1mm}population\hspace{1mm}slope\] Our estimates are represented as :

\[ \hat\beta_0\] \[\hat\beta_1\]

How to reach the best estimate?

The Idea is to, essentially, draw a line through the points such that distance of every point from line is as small a possible.

Least squares

One way to get estimates of population coefficients or parameters is minimizing least squares.

\[sales \approx \beta_0 + \beta_1*TV\]

\[\hat y_i = \hat\beta_0 + \hat\beta_1x_i\] \[e_i = y_i - \hat y_i\]

\[RSS = e_1^2 + e_2^2....+e_n^2\]

Minimize RSS

Least square coefficient estimates

\[ \hat\beta_1 = \frac{\sum_i^n(x_i - \bar x)(y_i - \bar y)}{\sum_i^n(x_i - \bar x)^2} \]

\[ \hat\beta_0 = \bar y - \hat\beta_1\bar x \]

The model


Call:
lm(formula = sales ~ TV, data = advertisement)

Coefficients:
(Intercept)           TV  
    7.03259      0.04754  

“For every`additional $1000 spent on TV advertisement budget, there is additional sale of ~47.5 units”

Exercise

Use the data Auto from the {ISRL2} Fit this model.

\[horsepower = \beta_0 + \beta_1*weight + \epsilon\]

find coeff estimates and residuals: \[\hat\beta_0\] and \[\hat\beta_1\]

How well did we estimate the coefficients?

\[Compute\hspace{1mm} standard\hspace{1mm}error\hspace{1mm} of\hspace{1mm} \hat\beta_0\hspace{1mm} and\hspace{1mm} \hat\beta_1\]

something like this:

\[Var(\hat\mu) = SE(\hat\mu) = \frac{\sigma^2}{n}\]

but in reality

\[SE(\hat\beta_0)^2 = \sigma^2[\frac{1}{n}+\frac{\bar x^2}{\sum_i^n(x_i - \bar x)^2}]\hspace{2cm}SE(\hat\beta_1)^2 = \frac{\sigma^2}{\sum_i^n(x_i - \bar x)^2}\]

What is sigma here ?

\[what\hspace{1mm} happens\hspace{1mm} when\hspace{1mm} x_i\hspace{1mm} are\hspace{1mm} spread\hspace{1mm} out\hspace{1mm} ?\]

We can use SE to to hypothesis testing. t-statistic is used to do this in practise

\[t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)}\]


Call:
lm(formula = sales ~ TV, data = advertisement)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16